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ADstract 


In  this  paper  we  investigate  the  robustness  of  several  ieadlcck 
detection  algorithms  for  distributed  computing  systems.  V/e  analyze 
the  behavior  of  each  algorithm  in  the  presence  of  two  classes  of 
failures  -  lost  messages  and  single  site  failures.  In  the  case  of 
single  site  failure  we  consider  six  different  types  of  sites  iepending 
on  how  they  can  participate  in  deadlock  and  deadlock  ietection.  The 
observation  and  conclusions  made  in  this  paper  are  intended  to  show 
hew  robust  the  present  algorithms  are  and  to  provide  an  insight  and 
better  understanding  of  distributed  algorithms  robustness. 


I.  INTRODUCTION. 

-here  have  been  many  algorithms  published  for  deadlock  detection, 
prevention  or  avoidance  in  centralized  multiprogramming  systems.  The 
problem  of  deadlock  in  those  systems  has  been  essentially  solved.  In 
the  past  decade  there  has  been  considerable  work  done  on  distributed 
computer  networks  and  multiprocessor  systems.  Both  of  these  are 
;rrdecessors  of  listributed  computing  systems  which  are  presently  a 
focus  of  intensive  research  and  development  in  academia  and  industry. 
Many  techniques  for  concurrency  control,  reliability/recovery  or  secu- 
rity developed  for  centralized  (or  single  CPU)  systems  have  been  or 
are  being  adopted  and  adapted  for  distributed  computing  systems.  For 
example,  there  is  a  tendency  to  use  locking  as  a  general  synchroniza- 
tion technique  in  distributed  systems  'and  its  special  variant,  two- 
phase  locking,  for  distributed  database  systems.  rJr>  until  recently  it 
has  been  argued  that  the  frequency  of  deadlock  occurence  in  existing 
applications  is  so  low  that  the  problem  of  deadlock  in  distributed 
systems  is  not  very  important  and  therefore  oan  be  managed  by  adopting 
techniques  developed  for  centralized  systems.  However,  it  has  become 
recently  apparent  that  deadlocks  may  be  a  problem  in  the  future  as  we 
see  new  applications  featuring  large  processes  and/or  many  concurrent 
processes  or  transactions [ C-RA81  ] .  As  an  example  of  such  new  applica- 
tions we  mention  information  utility  systems  which  service  concurrent- 
ly hundreds  or  perhaps  thousands  of  TV  users. 

The  distributed  computing  systems  'are  characterized  by  the  ab- 
sence of  global  memory  and  by  message  transmission  delays  which  are 
not  negligible.  Additionally,  the  processes  operating  at  the  same  or 
iifferent  sites  can  communicate  with  each  other,  and  oan  share 
resources .   _f  lockins  is  usee  as  the  svnehronization  technique  ^^^n 


the  last  two  items  raise  the  problems  of  deadlock  occurence  in  distri- 
buted systems,  and  the  first  two  characteristics  of  distributed  sys- 
tems make  it  much  more  difficult  to  detect,  avoid  or  prevent  than  in 
the  earlier  multiprogramming  centralized  computing  systems. 

Deadlock  prevention  and  avoidance  .algorithms  for  a  distributed 
computing  systems  are  not  efficient.  Prevention  can  be  accomplished 
by  not  allowing  concurrent  processing,  by  assigning  priorities  and 
allowing  preemption,  by  requiring  a  process  to  acquire  all  resources 
it  will  need  before  it  starts,  or  'oj  having  no  locks.  Requiring 
sequential  execution  in  a  distributed  system  is  a  gross  waste  of 
resources.  Having  prioritised  processes  will  result  in  lower- 
prioritied  processes  being  restarted  many  times,  with  a  major  degrada- 
tion in  system  efficiency.  Dynamic  prioritization  would  be  a  complex 
algorithm  by  itself.  A  process  may  be  unable  to  determine  its  minimum 
set  of  resources,  and  therefore  would  have  to  acquire  the  set  of  all 
probable  and  possible  resources,  even  though  it  may  not  need  them.  In 
addition,  in  systems  in  which  messages  are  treated  as  resources,  it  is 
impossible  to  determine  in  advance  which  messages  will  be  required. 
Having  no  locks  may  result  in  database  inconsistencies,  assuming  a 
non-optimistic  concurrency  controller.  Similarly,  deadlock  avoidance 
algorithms,  which  either  calculate  a  'safe  path'  [GCIT77]  cr  never  wait 
for  a  "  Dck  JRATS  '  are  also  inefficient,  oafe  rath  algorithms  reciuire  a 
non-trivial  execution  oime,  and  must  be  done  each  time  a  resource 
request  is  tc  be  granted.  "lever  waiting  for  a  lock  is  inefficient 
when  deadlock  is  a  rare  occurence.  Thus,  in  distributed  oomputing 
systems,  deadlock  ietection  and  resolution  algorithms  must  be  used. 

m^opca  'ire  f~ur  criteria  that  anv  ieadlock  detection  al=*r,r~tbrn  for 


robustness,  3)  performance,  and  4)  practicality.  Correctness  refers 
to  the  ability  of  the  algorithm  to  detect  all  deadlocks,  and  the  abil- 
ity to  not  ietect  any  false  deadlocks.  Robustness  refers  to  the  abili- 
ty of  the  algorithm  to  be  correct  even  in  the  presence  of  anticipated 
faults.  This  includes  the  ability  to  detect  deadlocks  even  when  a 
site  fails  or  loses  communications  while  the  deadlock  detection  algo- 
rithm is  being  executed.  The  performance  of  the  algorithm  refers  to 
its  overhead  -  the  delays  between  deadlock  and  detection,  CPU  time 
used,  number  of  messages  required,  etc.  Practicality  is  closely  re- 
lated to  performance.  It  refers  to  aspects  such  as  complexity  and 
cost. 

Several  different  approaches  are  being  used  in  current  deadlock 
detection  and  resolution  algorithms  for  distributed  systems.  Two  major 
ones  are  centralized  and  distributed  deadlock  detection  algorithms. 
Within  the  distributed  class  are  twc  subclasses;  1  )  all  or  several 
sites  execute  the  deadlock  detection  algorithm,  and  2N  only  one  site 
is  actually  executing,  although  the  algorithm  is  resident  in  all  sites 
and  thus  any  site  could  execute  the  algorithm.  It  might  be  easier  to 
view  the  algorithms  as  a  continuum:  fully  centralizedfG-RATB], 
hierarchical[MEN79],  distributed  with  a  single  site  at  a  time  execut- 
ing the  algorithm[CCL77],  distributed  with  all  sites  involved  in  a 
possible  deadlock  executing  the  algorithm  concurrently^  11779  \  'and 
distributed  with  all  sites  executing  the  algorithm 
concurrently!' I3L78] . 

In  this  paper  we  investigate  the  robustness  of  several  oubiished 
deadlock  detection  and  resolution  algorithms  for  distributed  systems. 
The  motivation  for  our  work  romeo  from  three  facts,  first,  verTT  few 
authors  investigated  robustness  t  reliability  :f  deadlock  ietection 


algorithms.  Second,  reliable  deadlock  detection  and  resolution  for 
upcoming  new  distributed  systems  and  applications  is  in  our  opinion  an 
urgent,  very  important  and  as  yet  not  satisfactorily  resolved  problem. 
Third,  as  there  can  be  more  than  one  deadlock  being  detected  by  the 
deadlock  detection  algorithm  then  it  is  reasonable  to  expect  such 
algorithm  to  be  robust,  i.e.,  to  continue  executing  and  detecting  all 
deadlocks  even  in  the  presence  of  failure (3)  which  might  have  in  ef- 
fect creaked  one  of  the  deadlocks  being  detected. 

The  taper  is  organized  as  follows.  In  section  two,  we  discuss 
robustness  of  distributed  systems.  In  section  three,  we  analyze  the 
robustness  of  several  existing  deadlock  detection  algorithms  with 
respect  to  some  single  failures.  In  section  four,  we  present  our  con- 
clusions based  on  the  analysis  of  section  3* 

II.  SOME  THOUGHTS  ON  ROBUSTNESS  III  DISTRIBUTED  SYSTEMS. 

Ir.  this  paper  we  want  to  investigate  the  robustness  of  deadlock 
detection  algorithms  (DDA),  i.e.,  we  want  to  find  out  the  impact  of 
some  single  failures  on  such  algorithms.  In  general,  the  DDA  is  in- 
voked by  two  events  -  either  whenever  a  process  waits  for  a  resource, 
or  after  a  certain  period  of  time  has  elapsed  since  the  last  DDA  invo- 
cation. In  the  first  case,  deadlock  is  checked  for  whenever  its  pos- 
sibility appears,  and  in  the  second  2sse  it  is  checked  for  periodical- 
ly (i.e.,  regardless  of  whether  its  possibility  exists). 

The  DDA  can  reside  in  one,  several  or  all  sites  of  the  distribut- 
ed computing  system.  When  a  triggering  event  for  DDA  occurs,  then 
depending  on  a  particular  algorithm  Dne,  several  or  all  sites  will 
receive  information  from  several  or  all  sites.  Such  information  oon- 
sists  of  "who  waits  for  whom  and  where",  and  it  can  he  represented  by 


arcs  of  the  wait-for  graph,  strings,  or  lists  of  processes  or  transac- 
tions, "pen  receipt  of  such  information  one,  several  or  all  sites 
attempt  to  reconstruct  a  global  state  of  the  distributed  system,  i.e., 
to  generate  a  true  snapshot  of  all  or  of  all  waiting  processes  in  the 
system. 

The  generation  of  such  a  true  snapshot  in  the  distributed  system 
is  difficult  because  of  lack  of  global  memory  and  the  message  delays 
which  are  not  neglibible  and  can  vary  considerably.  The  generation  of 
such  a  true  snapshot,  usually  referred  to  as  a  global  wait-for  graph, 
becomes  even  more  difficult  when  we  consider  a  possibility  of  failures 
in  the  distributed  system.  Some  system  mechanisms  have  been  designed 
to  be  robust  or  reliable.  For  example,  some  concurrency  control  or 
synchronization  mechanisms  for  distributed  databases  and  transaction 
processing  systems  are  based  on  two  phase  locking,  which  has  been  made 
robust  by  incorporating  atomicity  by  using  two  phase  commit  protocols. 
The  two  phase  commit  protocol  supports  not  only  the  atomicity  of  tran- 
sactions but  also  it  supports  the  robustness  of  locking,  i.e.,  the 
robustness  of  concurrency  control  mechanisms.  In  particular  what 
makes  the  concurrency  control  which  uses  locking  robust  is  the  need  to 
lock  and  unlock  resources  in  a  robust  way,  i.e.,  either  all 
lock/unlock  operations  for  a  given  process  or  transaction  occur  or 
none  occur.  Thus  in  seme  sense,  the  robustness  of  concurrency  Dontrol 
is  meant  to  support  the  atomicity  of  placing  and  releasing  a  set  of 
locks  needed  by  a  process.  In  other  words,  the  robustness  of  con- 
currency control  means  that  no  dangling  locks  or  locked  resources  are 
left  behind  the  terminated  or  committed  process,  even  in  the  rresence 
seme  failures.  It  is  interesting  to  ::o~e  that  although  ieadlock 
ietection  is  2,  tart  of  ooncurrency  control  based  zv   lock^ns  ^'oer0  has 


been  no  attempt  to  provide  for  or  even  to  investigate  the  robustness 
of  deadlock  detection  mechanisms.  The  most  likely  explanation  for 
this  is  that  from  the  concurrency  control  point  of  view,  the  inability 
of  the  process  to  lock  a  needed  resource  is  an  exception  to  be  handled 
by  another  mechanism,  i.e.,  a  deadlock  detection  algorithm  (EDA). 

The  proper  way  to  see  the  DBA  is  as  another  transaction  running 
under  the  concurrency  control  mechanism,  as  it  reads  and  shares  lock 
tables  with  concurrency  controllers  and  other  transactions.  However, 
DBA  is  a  special  transaction  which  operates  on  special  iata  it  creates 
solely  for  deadlock  detection,  e.g.,  wait-for  graphs.  Such  data, 
we'll  call  it  deadlock  iata,  is  internal  to  each  invocation  of  DBA 
transaction  and  is  erased  after  its  execution.  Moreover,  such 
deadlock  data  is  not  shared  by  any  other  DBA  transaction  invocations 
and  therefore  they  need  not  be  locked.  This  means  that  the  robustness 
required  of  IDA  transactions  is  of  a  somewhat  different  kind  than  the 
robustness  of  transactions  operating  on  shared  database  data.  Thus  it 
makes  sense  that  the  DBA  transaction  does  not  need  to  use  two  phase 
commit  to  assure  its  robustness.  'The  question  then  is  what  kind  of 
robustness  or  fault-tolerance  we  need  for  DDA  transactions  and  this  is 
precisely  the  problem  we  are  addressing  in  this  paper. 

We  consider  the  following  informal  model  of  DDA  transaction  exe- 
cution, -he  DDA  is  invoked  hv  a  concurrency  controller  at  a  site  at 
which  a  database  transaction  can  not  acquire  locks  which  are  being 
held  by  another  transaction!' s ) .  The  DBA  transaction  executes  at  one, 
several  or  all  sites  (depending  on  the  DDA  itself  and  the  deadlock 
topology ' .  During  its  execution  the  DDA  transaction  should  exhibit 
the  atomicity  croperty,  i.e.,  it  either  executes  correctly  or  it  does 
not  execute  at  all.  The  results  :f  IT A  transaction  execution  are  two 


messages  to  the  concurrency  controller  which  has  triggered  it: 

1 )  Proceed  -  because  of   a)  no  deadlock 

b )  deadlock  detected  but  another 
transaction  was  selected  as 
a  victim  for  back-up 

2)  Abort  -  because  of    a)  deadlock  detected  and  you  are 

the  victim. 
b)  DDA  transaction  failed,  i.e., 
it  did  not  execute. 


The  situation  we  investigate  in  this  paper  is  when  DBA  transac- 
tions fail  or  should  not  fail,  i.e.,  how  robust  the  existing  DDA's  are 
or  should  be.  In  this  paper  we  consider  only  two  classes  of  single 
failures.  First,  we  investigate  the  impact  of  lost  messages  and 
second,  we  investigate  the  impact  of  one  site  failures,  or  identically 
one  site  partitions  on  DDA  behavior.  We  investigate  the  impact  of  lost 
messages  because  not  all  distributed  systems  may  support  reliable 
delivery  of  messages,  several  algorithms  treat  messages  as 
resources [G0L77],  and  in  some  applications,  acknowledgements  cannot  be 
sent. 

III.  RELIABILITY  ANALYSIS  OP  DEADLOCK  DETECTION  ALGORITHMS 

In  this  section,  we  examine  four  published  deadlock  detection 
algorithms  for  distributed  computing  systems  with  respect  to  the  pres- 
ence of  the  two  classes  of  failures  (lest  messages  and  site  failures) 
iiscussed  in  section  two.  Although  very  few  of  them  have  already  been 
shown  to  be  correct  when  no  failures  or  errors  occur,  we  feel  that 
their  robustness  is  nevertheless  worth  analysing.  The  assumptions 
made  by  each  author  will  be  discussed  in  the  context  of  how  robust  the 
algorithm  is.  We  will  analyze  each  DDA  by  executing  it  in  the  follow- 
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resource  and  a  single  transaction.  (These  restrictions  merely  make  the 
example  simpler,  they  are  not  required  for  the  analysis.)  The  initial 
system  status  is  shewn  in  figure  1 .  Transaction  T1  at  site  A  holds 
resources  R2  and  R3  and  is  waiting  for  resource  R4-  Transactions  T2 
and  T?  hold  no  resources.  Transaction  T4  at  site  D  holds  resource  R4, 
hut  is  active.  We  assume  that  the  deadlock  ietection  activity  result- 
ing from  T1  waiting  for  ?.4  has  been  completed,  so  there  is  currently 
no  ieadlock  ietection  activity  in  the  system.  Tor  the  algorithms 
which  require  global  timestamps,  we  assign  timestamp  (TS)  t1  to  the 
T1<— R2  assignment,  t2  to  the  T4<—  R4  assignment,  t?  to  the  T1<—  P/3 
assignment,  and  t4  to  the  T1 — >R4  request.  Mow  at  some  time  t6,  tran- 
saction T4  requests  R3,  resulting  in  a  global  deadlock  T1 — >T4 — >T1  . 
Site  A        Site  3        Site  C        Site  D 


T1 


mp 


„_. 


!  I 
I  I 
I  I 
I    I 
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!     R4 
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Rigure  1 

In  the  case  of  a  site  failure,  we  distinguish  the  following  cases,  a) 
A  site  can  have  a  transaction  involved  in  a  deadlock  but  not  be  in- 
volved in  ieadlock  ietection,  b)  a  site  3an  have  a  transaction  in- 
volved in  a  ieadlcck  and  be  involved  in  ietection,  c)  a  site  ^an  have 
a  resource  involved  in  a  ieadlock  and  not  be  involved  in  detection,  d) 
a  site  can  have  a  resource  involved  in  a  deadlock  and  be  involved  in  a 
ietection,  or  e)  a  site  can  be  involved  in  deadlock  ietection  but  in 
no  way  involved  in  a  deadlock.  .Tot  all  of  these  possibilities  exist 


with  each  algorithm. 


A.  THE  DISTRIBUTED  DEADLOCK  DETECT!:::  ALGORITHM  \7    JOLDMAH. 

In  [GOL77],  Goldman  presents  two  deadlock  detection  algo- 
rithms. Only  the  distributed  version  will  be  considered  in  "his  pa- 
per. A  Process  Management  ?4odule  (PMM)  at  each  site  handles  resource 
allocation  and  deadlock  detection.  An  'ordered  blocked  process  list' 
(OBPL)  is  a  list  of  process  names,  each  of  which  is  waiting  for  access 
to  a  resource  assigned  to  the  preceeding  process  in  the  list.  The 
last  process  in  the  list  is  either  waiting  for  access  to  the  resource 
named,  or  it  has  access  to  that  resource.  An  OBPL  is  created  each 
time  a  PMM  wants  to  see  if  a  blocked  process  is  involved  in  a 
deadlock.  In  the  distributed  algorithm,  an  OEPL  is  passed  from  a  PMM 
to  another  PMM  which  has  information  either  about  a  resource  or  a 
transaction  in  the  OEPL  which  is  needed  to  expand  the  OEPL.  Each  PMM 
adds  the  information  it  knows,  and  either  detects  a  deadlock,  detects 
a  non-deadlocked  state,  or  passes  the  OBPL  to  another  PMM  for  further 
expansion.  The  terms  process  and  transaction  will  be  used  synonymous- 
ly in  the  analysis  of  this  DDA.  If  several  transactions  are  waiting 
on  one  transaction,  multiple  copies  may  be  made  of  "he  OEPL  and  sent 
to  each  site  having  one  of  those  waiting  transactions.  Processes  can 
be  in  either  of  2  states,  active  or  blocked  (waiting).  A  blocked  pro- 
cess could  be  waiting  for  a  database  object,  message  text  from  -another 
process  or  message  text  from  -an  operator.  A  process  is  active  if  it 
is  not  blocked.  In  the  algorithm,  ?X  and  EX  are  temporary  variables 
representing  a  process  or  resource.  The  steps  of  the  algorithm  are: 


1  .  Set  RX  to  the  value  contained  in  the  resource  identification 
portion  of  the  OEPL.  If  RX  represents  a  local  resource,  go 
to  2.  Otherwise,  go  to  3. 

0.  Verify  that  the  last  process  added  to  OEPL  is  still  waiting 
for  EX.  If  so,  go  to  3>  otherwise,  halt. 


3-  Let  PX  be  process  controlling  RX.  If  ?X  is  already  in  OBPL, 
then  there  is  a  deadlock.  If  not,  go  to  4. 

4«  If  ?X  is  local  to  current  PMM,  go  to  5,  otherwise  go  to  7. 

5«  If  PX  is  active,  there  is  no  deadlock.  Discard  OBPL  and 
halt.  Otherwise  go  to  6. 

6.  Add  PX  to  OBPL  and  go  to  10. 

7.  Add  PX  and  RX  to  1BPL.  Send  OBPL  to  PMM  in  site  in  which  PX 

resides.  Halt. 

3.  Verify  that  last  process  in  ABPL  still  has  access  to  RX.  If 
net,  there  is  no  ieadlock,  so  discard  OBPL  and  halt.  If  so, 
go  to  9- 

9«  If  last  process  in  OBPL  is  active,  there  is  no  deadlock,  so 
discard  OBPL  and  halt.  Otherwise  go  to  10. 

10.  Call  resource  for  which  last  process  is  waiting  RX.  If  RX  is 
local,  go  to  J.     Otherwise  go  to  1 1  . 

1 1  .   Place  RX  in  OBPL  and  send  OBPL  to  PMM  of  site  in  which  RX 
resides.  Halt. 


Figure  2  shows  the  actions  taken  at  each  site  during  the 
execution  of  the  DDA  following  the  request  by  14  for  resource  ?3«  Ihe 
numbers  refer  to  the  current  step  being  executed  by  the  DDA.  As  can 
be  seen,  the  algorithm  correctly  ietected  the  resulting  deadlock,  in 
an  environment  of  no  faults.  If,  however,  a  message  is  lost  (in  cur 
example,  either  the  OBPL  sent  from  site  C  to  A,  or  the  OBPL  sent  from 
A  to  D) ,  the  necessary  information  to  detect  the  ieadlock  will  be 
lost,  and  one  algorithm  will  fail  oo  detect  an  existing  Ieadlock. 


Site  A  cite  C  Site  D 

10.  Create  OBPL  with 

T4-  set  ?:•:  =  S3 

3.  T1  controls  R3, 

T1  not  in  OBPL. 

4.  T1  not  local 

7.  Add  T1  and  R3  to 

OBPL  .and  send 

to  site  A. 

3.  T1  has  access  to  R3« 

9.  S1  waiting. 

1 0 .  Set  RX  =  R4  • 

1 1 .  Add  R4  to  OSPL, 

send  to  site  D. 


1  .  Set  RX=R4 . 

2.  S1  waiting  for 

R4. 

3.  Set  ?X=?4.  S4 
already  in  OBPL, 
deadlock  detected. 


Piaure  2 


Goldman's  algorithm  allows  the  following  types  of  sites  dis- 
cussed previously:  type  b  (a  site  can  have  a  transaction  involved  in 
deadlock  and  the  site  is  involved  in  detection),  type  d  (a  site  can 
have  a  resource  held  by  a  transaction  involved  in  deadlock  and  the 
site  will  be  involved  in  deadlock  detection),  and  type  c  (a  site  can 
have  a  resource  held  by  a  transaction  involved  in  a  deadlock  and  not 
be  involved  in  deadlock  detection) .  A  site  could  also  be  in  several 
of  the  categories  above,  depending  on  the  complexity  of  the  system 
state.  ?or  example,  site  D  could  be  considered  a  type  b  or  type  d 
site.  If  a  site  of  type  b  (sites  A  or  D  in  our  example N|  fails  luring 
execution  of  the  DDA,  the  behavior  could  be  different  depending  on  the 
time  of  the  failure.  If  the  failure  secured  at  site  A  before  site  C 
sent  the  OBPL  to  site  A,  site  C  would  realize  that  site  A  had  failed. 
She  algorithm  includes  no  procedure  for  this  occurence,  so  the 
behavior  would  be  dependent  on  the  underlying  system.  If  the  failure 
at  site  A  occured  -after  it  received  the  3BPL,  all  deadlock  detection 


activity  will  cease,  because  only  site  A  was  currently  involved  in 
deadlock  detection.  A  system  timeout  mechanism  would  eventually  abort 
the  transactions  involved  in  the  deadlock.  A  failure  at  site  D  would 
have  the  same  effect  as  at  site  A. 

If  a  site  of  type  I  (site  2  in  our  example)  failed,  the  time 
of  the  failure  would  again  determine  the  behavior  of  the  DDA.  If  the 
failure  cccured  before  cite  Z  sent  the  IEPL  to  site  A,  ieadlcck  detec- 
tion activity  would  cease  without  deadlock  having  been  ietected.  If 
the  I3PL  had  been  sern:,  however,  deadlock  detection  would  continue  at 
sites  A  and  D  (sequentially)  with  site  D  detecting  a  deadlock.  The 
failure  of  site  C  would  net  have  been  critical  after  the  OBPI  had  been 
sent.  The  effect  of  a  type  c  site  (site  B  in  cur  example)  failing 
would  have  no  effect  on  the  behavior  of  the  DDA,  because  the  fact  that 
32   is  held  by  11  is  not  used  or  known  by  the  DDA  at  any  site. 

There  are  essentially  two  types  of  OBPL's  created  by  this 
DDA.  The  first  type,  call  it  W,  is  when  a  process  is  waiting,  but  is 
not  involved  in  a  deadlock.  This  0BP1  is  subsequently  discarded.  The 
second  type,  call  it  D,  is  one  which  will  eventually  show  a  deadlock 
cycle.  If  there  are  n  transactions  involved  in  a  ieadlcck  cycle,  this 
DDA  will  create  from  1  to  n  type  D  OBPL's.  In  our  example,  only  one 
was  created.  If  the  request  by  II  for  resource  ?4  hapened  simultane- 
ously with  the  request  by  14  for  resource  RJ,  two  DBPL's  would  have 
been  created  which  would  have  resulted  in  two  sites  independently 
detecting  the  same  deadlock,  vice  the  one  site  in  our  example.  Thus 
the  robustness  of  this  algorithm  with  respect  to  a  single  site  failure 
is  related  to  the  ratio  of  the  number  of  D  tyce  DBPL's  created  to  the 
lumber  ;f  transactions  involved  in  the  ieadlcck.  This  ratio  is  howev- 
er :e~erminei  by  the  seauencins  :r  timing  :f  transactions  messages 


blocked  resources.  Such  sequencing  is  of  random  nature.  A  ratio  of  1 
would  provide  the  highest  degree  of  robustness.  When  only  a  single 
13P1  is  created,  the  robustness  of  the  DDA  is  very  similar  to  that  of 
a  centralized  DDA;  a  single  site  failure  can  stop  deadlock  detection 
activity,  'ie  conclude  that  the  robustness  of  this  DDA  can  be  analyzed 
but  it  can  not  be  predicted. 

3.     THE  '-MIASCE-MTZ  DISTRIBUTED  ALGORITHM 

In  [MEH79]f  Menasce  and  Muntz  presented  a  distributed 
deadlock  detection  algorithm,  G-ligor  and  Shattuck  [GLI30]  presented  a 
counter  example  which  showed  the  algorithm  to  be  incorrect  in  that  it 
failed  in  some  cases  to  detect  a  deadlock.  They  also  proocsed  a 
modification  to  the  algorithm  which  they  thought  would  make  it 
correct,  but  they  felt  the  algorithm  was  impractical.  In  [TSA82], 
Tsai  and  3elford  show  that  the  algorithm  as  modified  by  Gligor  and 
Shattuck  is  also  incorrect.  nevertheless,  we  will  investigate  the 
enhanced  algorithm  (i.e.,  its  modified  version  as  suggested  by  G-ligor 
and  Shattuck)  in  the  presence  of  errors. 

The  algorithm  constricts  a  Transaction-Waits-For  (TWF)  graph 
at  originating  sites  of  transactions  which  are  potentially  involved  in 
the  deadlock  being  detected,  and  at  sites  at  which  some  transaction 
could  not  acquire  a  resource.  Nodes  in  the  WF  graphs  represent  tran- 
sactions. An  edge  (Ti,Tj)  indicates  that  transaction  Ti  is  waiting 
for  transaction  T.j.  A  non-blocked  transaction  is  a  transaction  that 
is  not  waiting  -and  is  represented  in  the  TWF  graph  by  a  node  with  no 
:ut going  arcs.  A  blocked  transaction  is  waiting  for  some  transaction 
to  finish.  A  'Hocking  set'  is  defined  as  the  set  of  all  non-blocked 
transactions  which  can  be  reached  by  following  a  iireoted  path  in  the 
7T  jrrath  startins  at  One  node  associated  with  transaction  I1  !iHr" 


A  pair  (?,?')  is  a  'blocking  pair'  of  T  if  T1  is  in  the  blocking  set 

of  T.  A  'Potential  Blocking  set'  consists  of  all  waiting  transactions 

that  can  be  reached  from  T  [C-LI8G].  Sorig(T)  means  the  site  of  origin 

of  transaction  T.  ok  is  the  site  currently  executing  the  algorithm. 

The  rules  which  define  the  enhanced  algorithm,  as  executed  as  site  3k, 

are: 

Rule  0:  When  a  transaction  T  requests  a  nonlocal  resource  it  is 
marked  'waiting' . 

Rule  1 :  The  resource  R  at  site  3k  cannot  be  allocated  to  tran- 
saction T  because  it  is  held  by  31  ,  ...,3k.  Add  an  arc  from 
3  to  each  of  the  transactions  31,..., 3k.  If  there  is  then  a 
cycle  formed  in  the  3wT  graph,  deadlock  has  been  detected. 
Otherwise,  for  each  transaction  3f  in  blocking  set(T),  send 
the  blocking  pair  (T,T')  to  Sorig(T)  if  Sorig(T)  =/=  Sk  and 
to  Sorig(T')  if  Sorig(T')  =/=  Sk.  ?orm  a  list  of  potential 
blocking  pairs  associated  with  3. 

Rule  2:  A  blocking  pair  (3,3")  is  received.  Add  an  arc  from  3 
to  3'  in  the  TWF  graph.  If  a  cycle  is  formed,  then  a 
deadlock  exists. 

Rule  2.1:  If  3'  is  blocked  and  Sorig(T)  =/=  Sk,  then  for  each 
transaction  3"  in  the  blocking  set(T),  send  the  blocking; 
pair  (3,3")  to  3orig(3")  if  Sorig{T")  =/=  Sk. 

Rule  2.2:  If  3  is  waiting  and  Sorig(3)  =  Sk,  then  for  each  po- 
tential blocking  oair  (T" ,3)  send  the  blocking  pair  (T",T) 
to  3orig(3")  if  3orig(T")  =/=  Sk.  Then,  discard  the  poten- 
tial blocking  pairs  (I",T)  and  erase  the  'waiting'  mark  of 


Rigure  3  shows  the  actions  taken  at  each  site  during  the  execu- 
tion of  the  DBA  following  the  request  by  34  for  resource  R3-  As  can 
he  seen,  the  deadlock  was  correctly  ietected  by  site  A,  in  absence  :f 
failures.  If  the  request  message  (T4,R3)  from  site  D  to  site  C  was 
lost,  however,  deadlock  detection  activity  would  cease.  If  the  block- 
ing pair  (34,31  from  site  Z  to  site  3  was  lost,  site  A  would  still 
detect  the  ieadlcck.  If,  however,  txhe  blocking  pair  ;  31,31)  from  site 
C  ~o  site  A  was  lost,  site  3  would  apply  rule  2.  Reither  rule  2.1  or 
2.2  afolies  so  deadlock  letection  activitv  would  ^eas^. 
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Figure  3- 

-his  algorithm  allows  sites  of  types  b,  c,  d  and  e,  although 
our  example  does  not  include  a  site  of  type  e.  If  a  type  b  site  (one 
having  a  transaction  involved  in  the  deadlock  and  the  site  is  also 
involved  in  detection)  failed,  in  our  example  site  A  (or  site  D),  the 
"behavior  of  the  algorithm  is  dependent  on  the  time  of  failure.  If 
site  A  failed  before  receiving  the  blocking  pair  (T4,T1),  site  C  would 
recognize  the  failure,  but  its  action  is  not  specified  in  the  rales  of 
the  DDA.  Site  D  would  not  detect  the  ieadlock  for  the  same  reson  as 
if  the  message  from  site  G  to  site  A  was  lost.  If,  however,  the 
failure  occured  after  site  A  received  the  blocking  rair,  deadlock 
detection  activity  would  continue  (at  site  D)  but  deadlock  would  net 
be  detected.  A  failure  of  site  D,  also  a  type  b  site,  at  any  time, 
would  have  no  effect  on  detecting  the  deadlock  in  this  example.  If  a 
type  o  site  failed  (site  3),  it  would  have  no  effect  on  detecting  the 
ieadlock.  If  a  type  i  site  (site  2)  failed,  the  time  :f  its  failure 
would  letermine  the  behavior  zf   the  DDA.  If  it  failed  before  sendina 


the  "blocking  pair  to  sites  A  and  D,  deadlock  detection  activity  would 
cease.  If  it  failed  after  sending  those  messages,  it  would  have  no 
effect  on   detecting  the  deadlock. 

?or  our  example,  this  algorithm  behaved  surprisingly  simi- 
larly to  Goldman's  algorithm  in  almost  all  types  and  timings  of 
failures.  This  may  just  be  an  anomaly  found  in  small  deadlock  cycles, 
because  in  longer  'and  more  complex  scenarios,  it  would  appear  that 
mere  sites  would  be  involved  in  ietection,  and  that  there  would  be 
some  duplication  of  information.  As  the  number  of  transactions  and 
resources)  involved  in  a  deadlock  cycle  increases,  more  blocking  pairs 
and  potential  blocking  pairs  will  be  sent  to  more  sites,  i.e..  the 
number  of  sites  detecting  the  deadlock  is  increasing  with  the  number 
of  transactions  involved  in  the  deadlock  and  with  the  deadlock  topolo- 
gy (or  complexity).  Thus  there  will  be  more  chance  of  a  deadlock 
being  detected,  as  more  parallel  detection  activity  will  be  in  pro- 
gress. It  appears,  then,  that  as  the  site  and  complexity  of  deadlock 
increases,  the  robustness  of  this  algorithm  increases.  However,  as 
pointed  out  by  Gligor  and  chattuck,  the  effect  which  Gligor  and  3hat- 
tuck  point  out  of  mile  2.2  discarding  information  too  early  may  have 
seme  impact  on  the  increased  robustness. 

C.  OBERMARCK'S  DISTRIBUTED  DEADLOCK  DETECTION  ALGORITHM. 

In  [OEESO],  Ibermarck  presents  a  distributed  deadlock  detec- 
tion -algorithm.  A  centralized  algorithm  is  presented  ^:j  Obermarck  and 
Beeri  in  '_2EZS1j,  but  it  is  not  discussed  here  because  no  mention  is 
made  in  that  paper  about  a  backup  capability  if  the  site  containing 
the  centralized  deadlock  detector  fails.  Ibermarck 's  distri:-\;~-  ] 
algorithm  constructs  a  transaction— waits— for  TWE)  """an-"1  a~  each  lite. 
Zach  site  conducts  ieadlock   ieoec^i"10   simultaneously,    ~assi"j* 


information  to  one  other  site.  Deadlock  detection  activity  at  a  site 
may  become  temporarily  inactive  until  receipt  of  new  information  from 
another  site.  Obermarck  states  that  in  actual  practice,  synchroniza- 
tion (not  necessarily  precise)  between  sites  would  be  roughly  con- 
trolled by  an  agreed-upon  interval  between  deadlock  detection  itera- 
tions, and  by  timestamps  on  transmitted  messages.  -lodes  in  the  graph 
represent  transactions,  -and  edges  represent  a  transact ion-waits-for- 
transaction  (TWFT)  situation.  A  "String'  is  a  list  of  TWFT  informa- 
tion which  is  sent  from  one  site  to  one  or  more  sites.  A  transaction 
may  migrate  from  site  to  site,  in  which  case  an  'agent'  represents  the 
transaction  at  the  new  site(s).  A  communication  link  is  also  esta- 
blished between  agents  of  a  transaction.  These  communication  links 
are  represented  by  a  node  called  'External.'  An  agent  which  is  expect- 
ed to  send  a  message  is  shown  in  the  T,/F  graph  by  EX — >T,  while  an 
agent  waiting  to  receive  is  shown  by  T — >SX.  Although  Obermarck 's 
algorithm  includes  the  resolution  of  deadlocks,  only  the  detection 
part  will  be  considered  in  this  paper.  Transaction  ID'S  are  network 
unique  names  for  transactions,  and  are  lexically  ordered.  (For  exam- 
ple, T1  <  T2  <  T3).  The  steps  performed  at  each  site  are: 

1  .  Build  a  TWF  graph  using  transaction  to  transaction  wait-for 
relationships. 

2.  Obtain  and  add  to  the  existing  TWF  graph  any  'strings' 

transmitted  from  other  sites. 

a.  For  each  transaction  identified  in  a  string,  create  a 
node  in  the  TWF  if  none  exists  in  this  site. 

b.  For  each  transaction  in  the  string,  starting  with  the 
first  (which  is  always  'external'^,  create  an  edge  to  the 
node  representing  the  next  transaction  in  the  string. 

3.  Create  wait-for  edges  from  'external'  to  each  node  represent- 

ing a  transaction's  agent  which  is  expected  to  send  on  a 
communication  link. 

4.  Create  a  WF  edge  from  each  node  representing  a  transaction's 

tase  '  2 


argent  which  is  waiting  to  receive  from  a  communication  link, 
to  'external. ' 

Analyze  the  graph  for  cycles. 

After  resolving  all  cycles  not  involving  'external',  if  the 
transaction  ID  of  the  node  for  which  'external'  waits  is 
greater  than  the  Transaction  ID  of  the  node  waiting  for 
' external ' ,  then 

a.  Transform  the  cycle  into  a  string  which  starts  with 
'external',  followed  oy  each  transaction  ID  in  the  oycle, 
ending  with  the  transaction  ID  of  the  node  waiting  for 
'external' . 

b.  Send  the  string  to  each  site  for  which  the  transaction 
terminating  the  string  is  waiting  to  receive. 


In  his  proof  of  correctness,  Obermarck  shows  how  the  algo- 
rithm oan  detect  false  deadlocks  because  a  string  received  at  a  site 
may  no  longer  be  valid  when  it  is  used.  He  discusses  two  methods  of 
handling  false  deadlocks;  treat  them  as  actual  deadlocks (if  they  don't 
occur  too  often ) ,  or  verify  them  by  sending  them  around  the  network 
and  have  each  site  verify  them. 
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Figure  4  shows  a  global  picture  of  the  system,  including  the 
;ommunicaxicn  links  established  between  agents,  for  the  initial  :ondi- 
:ions  of  our  example.  The  agents  of  II   at  sites  I  and     have 

oa<?9  1  Q 


performed  work  (used  R2  and  RJ^  ,  and  are  waiting  for  the  next  request 
from  T1  at  site  A.  71  at  site  A  is  waiting  for  its  agent  at  site  D, 
which  is  in  resource-wait  for  74  •  figure  5  shows  the  actions  of  this 
algorithm  in  an  environment  of  no  errors.  As  can  be  seen,  it  success- 
fully detects  the  deadlock. 
Site  A       Site  E        Site  C        Site  D 

74  requests  R3 
An  agent  of 
74  is  formed 

1  ,3,4:  each  site  starts  deadlock  detection  and  builds  V/T  graph. 

71 — >77<:      71— >SX      71—  >EX— >74     EX—>T1  — >T4 

t_l  t I       t I 


5:  list  elementary  cycles 

T1__>7X— 71  EX->74->71->SX    SX->T1  ->T4->EX 

6:  form  string 
(EX, 74, 71) 
Send  to  A. 
2:  +-74<-f 
+      ! 
71 >EX 


t 


5:   Form  string 
(EX, 74, 71  ) 
Send  to  D. 


2: 

! 1 

~ — yni \nvi 


1 1 


5:  Deadlock  letected 
Figure  5 

Cbermarck  assumes  that  messages  sent  are  received.  Shis  is 
essential  to  the  correctness  of  this  SEA,  because  it  is  ?asy  to  see 
what  happens  if  a  message  is  lost.  If  the  string  'EX, 74, 71  from  site 
:  -o  A. 


cease  without  detecting  the  deadlock.  The  use  of  agents  to  represent 
transactions  which  have  migrated  to  other  sites  allow  this  33A  to  have 
nodes  of  types  a  or  b,  if  we  substitute  'agents'  for  'transactions'  in 
our  definitions  at  the  beginning  of  this  section.  Site  3  would  be  an 
example  of  a  type  a  site,  while  the  other  three  sites  would  all  be 
type  b  sites. 

the  3DA.  A  failure  at  sites  A,  3  or  3  would  either  have  no  effect,  an 
undetermined  effect,  or  cause  deadlock  detection  activity  to  cease, 
depending  on  the  time  of  the  failure.  For  example,  if  site  C  failed 
before  sending  the  string  [EX,T4,-1  :'  to  site  A,  deadlock  detection 
activity  would  cease.  If  site  A  (or  D)  failed  before  the  string 
(EX,T4,T1  )  was  sent  to  them,  the  transmitting  site  would  recognise  the 
failure,  but  its  action  in  that  eventuality  is  not  included  in  the 
steps  of  the  3BA.  If  site  C  failed  after  sending  the  string,  the 
detection  activity  would  continue,  and  the  deadlock  would  be  detected. 
This  DDA  appears  to  be  potentially  more  robust  than  the  pre- 
vious two.  Each  site  contains  and  retains  more  information  in  its  WF 
graph,  and  all  sites  start  detection  activity  simultaneously,  and 
potentially  stay  involved  for  the  entire  detection  process.  The  use 
of  the  lexioai  ordering  of  nodes  was  for  optimization  of  the  number  of 
nessases  transmitted.  If  ~^^_^  oonstrairit  were  "  fted  ~be  strirss 
would  be  sent  to  all  sites  involved  from  all  sites  in  which  a  oyele 
existed.  In  our  example,  this  would  have  allowed  sites  A  and  3  to 
simultaneously  detect  ieadlock.  The  DDA  would  be  clearly  more  robust, 
but  "The  overhead  would  be  greater.  In  its  exiz~ir.c  form,  this  DDA's 
robustness  is  similar  to  the  previous  algorithms  because  it  is  essen- 
tially secuentially  detectins  the  Ieadlock. 


D.  TR2E  ALGORITHM  OP  TSAI  AND  3ELF0RD. 

In  [TSA82],  Tsai  and  Belfcrd  present  a  distributed  deadlock 
detection  algorithm.  They  utilize  a  "Reduced  Transaction-Resource" 
(RTR)  graph,  which  contains  only  a  subset  of  the  transaction  resource 
graph,  but  has  all  relevent  TWF  edges.  Nodes  in  the  RTR  graph  can  be 
transactions  or  resources.  The  algorithm  uses  a  concept  the  authors 
call  a  "reaching  pair",  which  is  the  basic  unit  of  information  passed 
from  site  to  site.  If  a  path  TiTj...Tn  can  be  formed  by  following  TV? 
edges,  and  if  there  is  a  request  edge  (Tn,Rm),  then  Ti  "reaches"  Rm, 
and  (Ti,Rm)  is  a  "reaching  pair."  Rive  types  of  messages  are  sent 
between  sites:  reaching  messages,  nonlocal  request  messages,  alloca- 
tion messages,  release-request  messages,  and  releasing  messages.  The 
non-lccal  request  messages  include  a  list  of  all  resources  currently 
held  by  the  requesting  transaction.  Five  different  types  of  edges  are 
distinquished  in  the  RTR  graph:  requesting  edges,  allocation  edges, 
TVF  edges,  resource  reaching  edges  and  transaction  reaching  edges.  A 
global  timestamp  is  also  used  to  establish  an  ordering  of  events. 
This  timestamp  is  used  on  allocation,  request  and  reaching  messages, 
and  on  allocation  and  reaching  edges  in  the  RTR  graph.  The  notation 
used  in  the  algorithm  is: 

TS(M):  timestamp  of  a  message 

TS(C):  current  system  time 

TS(A):  timestamp  of  an  allocation  edge 

IS(R):  "imestamp  of  a  reaching  edge 

=/=:  not  equal  to 

Sorig:  Site  of  origin 

The  steps  of  the  algorithm  (as  executed  at  site  Ok)  are: 

Step  1 :  | A  transaction  T  enters  the  system  requesting  a  nonlocal 
resource  R}  Add  request  edge  (^,^1)  to  RTR  graph.  Send  re- 
quest message  (T,R',R,TS)  to  3orig(R),  where  R'  is  the  set 
of  all  resources  allocated  to  T,  and  TS(M)  =  TS(C).  R'  has 
each  TS(A)  attached,  and  R'  is  empty  if  T  holds  no 
resources . 
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Step  1a:  |A  transaction  T  releases  a  nonlocal  resource  Rj  Erase  edge 
(R,T)  in  the  RTR  graoh.  Send  a  release-request  message (R,T) 
to  Sorig(R). 

Step  2:  {A  transaction  T  enters  system  requesting  local  resource  R! 
Go  to  step  4« 

Step  2a:  (A  transaction  T  releases  a  local  resource  R}  Erase  edge(R,T) 
in  RTR  graph.  If  there  is  any  transaction  T"  waiting  for  R, 
then  begin 

Add  allocation  edge  (R,TT)  to  RTR  gram  with  TS(A)  = 
TS(C).  Send  allocation  message  (R,T',TS)  with  TS(M)  = 
TS(C)  to  Sorig(T')  if  Sorig(T')  =/=  SI:,  end. 

Step  3:  I A  request  message  (T,R',R,TS)  is  received!  Add  allocation 
edges  (Ri,TA  for  each  Ri  in  R'  to  RTR  graph.  Gto  to  step  4. 

Step  3a:  JA  release-request  message  (R,"!^  is  received!  Erase  alloca- 
tion edge  (R,T)  in  RTR  graph.  Send  releasing  message  (R,T) 
to  3orig(T).  If  there  is  any  transaction  T'  waiting  for  R, 
then  begin 

Add  allocation  edge  (R,T»)  to  RTR  graph  with  TS(A)  = 
TS(C).  Send  allocation  message  (R,T',TS)  to  Sorig(T') 
if  Sorig(T")  =/=  Sk.  end. 

SteD  4:  If  R  is  not  held  by  any  transaction,  then  begin 

Add  allocation  edge  (R,T)  with  TS(A)=TS(C)  to  RTR 
graph.  If  Sorig(T)  =/=  Sk,  then  send  .an  allcation  mes- 
sage (R,T,TS)  with  TS(M)=TS(C)  to  Sorig(T).  end. 
else  begin 

Add  requesting  edge  (T,R)  to  RTR  graph.  Suppose  R  is 
held  'oj  transaction  I  • .  Add  edge  (T , T f )  to" RTR  graph . 
If  there  is  a  cycle,  deadlock  has  been  detected,  else 
go  to  step  5*  end. 

Step  5-  {reaching  message  generation  step}  If  there  are  two  edges 
(T,R)  and  (T,T!)  added  to  the  .graph,  and  if  TT'...T"  is  any 
path  obtained  by  following  the  TWF  and  transaction  reaching 
edges,  then  set  X  =R"  if  T"  has  outgoing  edge  to  R",  else 
set  X  =  R.  Eor  all  transaction  Ti  in  RTR  graph  reaching  X 
via  I,  do  begin 

If  Ti  holds  any  resource  R'  with  Sorig(Ti)  =/= 
Sorig(R')  and  Sorig(R')  =/=  Sk,  then  send  a  reaching 
message  >Ii,X,IS)  to  3orig(R')«  If  Sorig(Ti)  =/=  Sk 
and  Ti  =./=  T,  then  send  a  reaching  message  (Ti,X,TS)  to 
Sorig(Ti).  If  Sorig(Ti)  =/=  Sk  and  li  =~T  and  X  =  R" 
then  send  a  reaching  message  (Ti,X,TS)  to  Sorig(Ti). 
The  TS  in  the  reaching  message  is  set  to  TS(C)  if  trig- 
gered by  a  local  request,  and  set  to  TS(M)  of  the  non- 
local request  or  reaching  message  otherwise. 

Step  6:  lAn  allocation  message  (R,T,IS)  is  received!  If  R  is  an  entry 
in  the  graph,  then  begin 

Erase  allocation  edge  R,T')  and  all  reaching  ed^es 
(T",R)  with  TS(R)  <  TS(M)  -and  the  corresponding  7,7' 
edge  ''1,1''  and  transaction  reaching  ed.?es  I'',I'\  if 


they  exist,  where  T'  =/=  T.  Change  requesting  edge 
(T,R)  to  allocation  edge  (R,T)  with  TS(A)  =  TS(H)  if 

T,R  exists,  and  for  each  resource  reaching  edge 
'T",R),  add  the  transaction  reaching  edge  (T",T).  If 
Sorig(T)  =  3k,  wake  up  transaction  T.  end. 

Step  6a:  [A  releasing  message  (R,T)  is  received!  If  3orig(T)  =  Sk, 
wake  up  transaction  T. 

Step  7:  [A  reaching  message  (T,R,TS)  is  received!  If  there  exists  an 
allocation  edge  (R,T*)  in  the  graph  with  TS(M)  <  TS(A)  and 
T1  =/=  T,  then  skip  this  step,  else  begin 

Add  resource  reaching  edge  (T,R)  to  the  RTR  graph.  If 
R  is  held  by  transaction  I',  then  add  the  transaction 
reaching  edge  (T,T')  to  the  graph.  If  there  is  a  cycle 
in  the  graph,  there  is  deadlock  (go  to  step  3),  other- 
wise go  to  step  5-  end. 

Step  8:   | a  deadlock  has  "been  detected!  Take  appropriate  action. 


Rigure  6  shows  the  starting  WF  graphs  and  the  actions  of  the 
DDA  resulting  from  the  request  by  transaction  14  for  resource  ?7. .  An 
important  item  to  note  is  that  as  soon  the  request  is  made,  step  1 
adds  sufficient  information  to  the  WF  grapn  to  detect  a  deadlock,  but 
dees  not  check  for  deadlock,  so  the  request  is  sent  to  site  C  and  the 
algorithm  continues.  The  obvious  thing  to  do  would  be  to  add  a  check 
for  a  deadlock  cycle  in  step  one,  but  on  closer  analysis,  this  check 
may  lead  to  detection  of  false  deadlocks  (if,  for  example,  T1  had  just 
released  R3  but  the  message  had  not  yet  been  received  oy  site  D.) 
Therefore  the  algorithm  in  its  present  form  will  be  analysed.  The 
only  message  sent  by  this  algorithm  in  this  example  is  the  revues- 
message  (T4, |R4} ,R3»t6) .  If  it  was  lost,  the  current  algorithm  would 
cease  detection  activity  without  ietecting  deadlock.  In  this  in- 
stance, if  the  algorithm  checked  for  deadlock  in  step  1  ,  it  would  have 
been  detected  with  no  messages  reouired. 
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Figure  6 


For  this  DDA,  sites  can  "be  of  type  "b,  d  or  e.  Sites  A  and  D 
are  type  b  and  sites  3  and  C  are  type  d.  This  example  has  no  tyre  e 
sites,  cut  step  5  of  the  algorithm  could  send  reaching  messages  to 
sites  not  involved  at  all.  Those  sites  would  execute  a  step  or  two  of 
the  algorithm,  but  not  be  intimately  involved  in  the  actual  deadlock 
detection.  In  this  example,  a  failure  of  sites  A  :r  3  (types  b  and  i 
respectively)  would  have  no  effect  on  the  detection  of  the  deadlock. 
The  effect  of  a  failure  of  site  C  before  the  reaching  message  was  sent 
to  it  cannot  be  determined  because  the  DDA  includes  no  instructions 
for  that  event.  A  failure  of  site  C  after  receiving  the  reaching  mes- 
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site  G  at  any  time  would  have  no  effect  on  deadlock  detection.  The 
timing  of  the  failure  would  also  determine  the  behavior  of  the  DDA  if 
site  D  failed.  If  site  D  failed  before  sending  the  request  message, 
detection  activity  would  cease,  while  if  the  message  had  been  sent, 
deadlock  would  still  be  detected. 

?or  our  example,  this  DDA  appears  to  be  about  the  same  level  of 
robustness  .as  the  other  algorithms,  except  that  each  site  contains  and 
retains  more  information  than  in  other  DDA's.  This  indicates  that  it 
should  be  more  robust.  The  algorithm  in  the  case  of  our  example  was 
able  to  detect  the  deadlock  with  only  the  resource  request  message. 
As  deadlock  cycles  become  more  complex,  it  appears  that  this  algorithm 
will  also  become  more  robust,  even  more  so  than  Obermarck's,  because 
this  DDA  retains  more  information,  and  it  will  send  reaching  messages 
to  any  site  potentially  involved  in  the  deadlock.  Detection  activity 
will  occur  simultaneously  in  those  sites  receiving  reaching  messages. 
The  impact  of  the  inclusion  of  a  cycle  detection  in  step  1  may  have 
adverse  effects  on  the  correctness,  but  it  might  greatly  enhance  the 
robustness  of  the  DDA. 

IV.  CONCLUSIONS 

The  algorithms  discussed  in  the  previous  section  can  be  loosely 

ranked  by  their  robustness,  Goldman' s  algorithm  is  the  least  robust, 

because  it  is  always  executed  sequentially  (unless  the  requests  occur 

simultaneously,  as  discussed  previously) .  Thus  it  is  always  dependent 

on  a  single  node.   Obermarck's  algorithm  starts  deadlock  detection 

simultaneously  at  all  sites,  and  subsequently  passes  information  in  a 

lexical  manner  because  of   the  message  optimization.  Tbr  our  example, 

this  resulted  in  a  sequential  detection,  although  for  larger  deadlock 
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cycles,  it  should  have  some  parallel  detection  activity  occuring.  The 
Menasce-Muntz  algorithm  starts  detection  at  the  site  where  the 
deadlock  occured,  and  deadlock  detection  is  subsequently  conducted  at 
sites  which  are  potentially  involved.  The  Tsai-Belford  algorithm  is 
invoked  each  time  a  resource  is  requested.  Deadlock  detection  can 
appear  concurrently  at  all  sixes  potentially  involved  in  the  cycle. 
It  appears  more  robust  than  the  Menasce-Muntz  algorithm  because  more 
information  is  held  at  each  site. 

Our  analysis  supports  the  rather  obvious  conclusion  that  robust- 
ness is  inversely  related  to  it's  cost.  The  Tsai-Belford  algorithm 
appears  more  robust  than  Obermarck's  algorithm,  for  example,  but  it 
maintains  larger  WF  graphs  at  each  site,  and  is  invoked  each  time  a 
resource  is  requested,  in  order  that  the  WF  graphs  contain  sufficient 
information. 

For  the  example  we  used  to  analyze  the  four  algorithms  in  section 
3,  the  behavior  of  each  of  those  algorithms  in  the  presence  of  errors 
is  almort  identical.  Because  our  deadlock  cycle  only  involved  2  tran- 
sactions, those  algorithms  which  are  potentially  more  robust  in  the 
presence  of  larger  cycles  did  not  have  time  to  develop  their  robust- 
ness. In  other  words,  for  a  short  deadlock  cycle,  all  the  algorithms 
converged  within  approximately  the  same  length  of  time  (two  or  three 
iterations.)  Short  cycles  of  length  2  or  3  are  more  probable  in  exist- 
ing applications,  so  all  the  above  algorithms  are  approximately  equal- 
ly robust  in  current  applications.  In  future  applications  (informa- 
tion utility  programs,  for  example),  however,  we  expect  a  much  higher 
probability  of  more  complex  deadlock  cycles,  which  will  require  a  mere 
robust  IDA.  Conversely,  however,  as  the  number  of  transactions  '-and 
sites)  increases,  it  will  be  important  to  use  a  minimum  cost  IDA. 
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Work  is  currently  in  progress  on  a  new  robust  distributed  deadlock 
detection  algorithm. 
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